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1. Introduction 

This paper considers the problem of estimating a multivariate binary density 
from a number of independent observations. That is, we have n observations of 
the form Xi € {0, 1}'' which are independent and identically distributed (i.i.d.) 
samples from a population with a probability density / (with respect to the 
counting measure on the d-dimensional binary hypercube {0,1}''). We wish to 
estimate / on the basis of these observations. Multivariate binary data arise in 
a variety of applications: 

• Bio statistics. Each Xi could represent a biochemical profile of a bacterial 
strain, where every component is a "yes-no" indicator of a presence of 
a particular biochemical marker [18, 38]. A recent paper [31] proposed a 
methodology for representing gene expression data using binary vectors. 
More classical scenarios include recording the occurrence of a given symp- 
tom or a medical condition in a patient over time [12] or outcomes of a 
series of medical tests [1] . 
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• Quantitative methods in social sciences. Each could represent a respon- 
dent in a survey or a panel, where every component is a "yes-no" answer 
to a question [6], describe a voting record of a legislator, or correspond to 
co-occurrences of events in social networks [32] . 

• Artificial intelligence. Each Xi could represent a user query to a search 
engine or a database, where every component corresponds to the presence 
or absence of a particular keyword [13], or an image stored on a website 
like Flickr"'^, where every component corresponds to a user-supplied tag 
from a given list. 

Many situations involving multivariate binary data have the following features: 
(1) the number of covariates (or the dimension of the hypercube) d is such that 
the number of possible values each observation could take (2'^) is much larger 
than the sample size n; (2) there is a "clustering effect" in the population, mean- 
ing that the shape of the underlying density is strongly influenced mainly by 
a small number of constellations of the d covariates. For example, a particular 
class of bacterial strains may be reliably identified by looking at a particular 
subset of the biomarkers; there may be several such classes in the population of 
interest, each associated with a distinct subset of biomarkers. Similarly, when 
working with panel data, it may be the case that the answers to some spe- 
cific subset of questions are highly correlated among a particular group of the 
panel participants, and the responses of these participants to other questions 
are nearly random; moreover, there may be several such distinct groups in the 
panel. 

These considerations call for a density estimation procedure that can effec- 
tively cope with "thin" samples (i.e., those samples for which n < 2"^) in terms 
of both estimation error and computational complexity, and at the same time 
automatically adapt to the possible clustering in the population, in the sense 
described above. We take the minimax point of view, where we assume that the 
unknown density / comes from a particular function class T and seek an esti- 
mator that exactly or approximately attains the minimax mean-squared error 

i?:(J-)=infsupE||7-/||i„ 
/ fey" 

where the infimum is over all estimators based on n i.i.d. samples from /. We 
will choose the class T to model the "constellation effects" via a certain sparsity 
condition. Our choice of the risk, as opposed to other measures of risk more 
commonly used in density estimation, such as Hellinger, KuUback-Leibler or 
total variation risks, is dictated by the fact that the sparsity condition mentioned 
above is most naturally stated in a Hilbert space framework, which in turn 
facilitates the design of our estimation procedure, as well as the derivation of 
both upper and lower bounds on R*n{f). We refer the reader to several other 
works on density estimation that use risk [24, 7, 10, 40, 20, 19, 11]. We 
also note that, because the Euclidean norm (which for probability densities 
gives the total variation risk) dominates the Euclidean norm, and because 

"'^http : //www. f lickr . com 
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the square of the KuUback-Leibler divergence dominates the total variation 
distance [8], lower bounds on the squared risk automatically translate into 
lower bounds on the squared (total variation) risk and on the KuUback- 
Leibler risk. 

Because of the host of applications in which multivariate binary data natu- 
rally arise, several authors have investigated algorithms for estimation of their 
probability densities (see, e.g., [1, 28, 24, 7]). However, existing approaches ei- 
ther have very slow rates of error convergence or are computationally prohibitive 
when the number of covariates is very large. For example, the kernel density es- 
timation scheme proposed by Aitchison and Aitken [1] has computational com- 
plexity 0(nd) (where n is the number of observations and d the number of 
covariates), yet its squared error decays at the rate 0(n-''/ ('*+'')) [33], which 
is disastrously slow for large d. In contrast, orthogonal series methods, which 
can potentially achieve near-minimax error decay rates, require the estimation 
of 2^^ basis coefficients and do not easily admit computationally tractable esti- 
mation methods for very large d. For instance, using the Fast Walsh-Hadamard 
Transform to estimate the coefficients of a density in the Walsh basis (see be- 
low) using n samples requires 0{nd2'^) operations (see Appendix B in [28] and 
references therein). 

In this paper we present a computationally tractable orthogonal series esti- 
mation method based on recursive block thresholding of empirical Walsh coef- 
ficients. In particular, the proposed method entails recursively examining em- 
pirical estimates of whole blocks of the 2'^ different Walsh coefficients. At each 
stage of the algorithm, the overall weight of basis coefficients computed at pre- 
vious stages is used to decide which remaining coefficients are most likely to be 
significant or insignificant, and computing resources are allocated accordingly. 
It is shown that this decision is accurate with high probability, so that insignif- 
icant coefficients are not estimated, while the significant coefficients are. This 
approach is similar in spirit to the algorithm of Goldreich and Levin [17], origi- 
nally developed for applications to cryptography and later adopted by Kushile- 
vitz and Mansour [22, 25] to the problem of learning Boolean functions us- 
ing membership queries. Although there are significant differences between the 
problems of density estimation and function learning which are reflected in our 
estimation procedure, our algorithm inherits the computational tractability of 
the Goldreich-Levin scheme: in particular, it runs in probabilistic polynomial 
time. 

The proposed estimator adapts to unknown sparsity of the underlying density 
in two distinct ways. First, it is near-minimax optimal for "moderate" sample 
sizes d ^ n ^ with an error decay rate of 0(2~'^(d/n)^''/(^''+^)), where 

p G (0, 1] is a measure of sparsity and r = 1/2— 1/ p. Moreover, the computational 
complexity of our algorithm is automatically lower for sparser densities. Sparsity 
has been recently recognized as a crucial enabler of accurate estimation in "big-d, 
small-n" type problems [4, 5] . Specifically for densities on the binary hypercube, 
sparsity in the Walsh basis has a natural qualitative interpretation that the 
shape of the density is infiuenced mainly by a small number of constellations 
of the covariates. For example, if the components of a multidimensional binary 
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vector represent positive/negative outcomes in a series of medical tests, it is 
often the case that the outcomes of certain small constellations of tests play the 
determining role in the diagnosis. 

There are several different series expansions on the binary hypercube pre- 
sented in the literature, including the Rademacher- Walsh orthogonal series (see 
Appendix A in [28] and references therein) and the Bahadur expansion [2, 18]. 
We focus in this paper on the Walsh system, which is derived from Fourier anal- 
ysis on finite groups (see, e.g.. Chap. 4 of Tao and Vu [36]), for two reasons. 
First, the coefficients of a particular function in the Walsh system give us in- 
formation about the influence of the various subsets of the d variables on the 
value of the function [34, 9]. Second, the Walsh functions of a length-d vector 
can be factorized into products of Walsh functions of multiple shorter vectors 
with lengths summing to c?; this is detailed in Section 2.2. This factorization 
is central to the efhciency of the proposed coefficient estimation method. The 
Walsh system is widely used in the context of learning Boolean functions [25] , 
as well as in harmonic analysis of real- valued functions on the binary hypercube 
[34, 9]. 

1.1. Organization of the paper 

The remainder of the paper is organized as follows. Section 2 contains the pre- 
liminaries on notation, the Walsh system, and sparsity classes on the binary 
hypercube. Next, in Section 3 we describe the motivation behind the threshold- 
ing methods in orthogonal series estimation on the binary hypercube, introduce 
our recursive thresholded estimator, and analyze its MSE and computational 
complexity. The theorems of Section 3 are proved in Section 4. Some illustrative 
simulation results are given in Section 5. The contributions of the paper are 
summarized in Section 6. Finally, some technical results are relegated to the 
appendices. 

2. Preliminaries 
2.1. Notation 

The basic set {0, 1} will be denoted by B. For any integer fc > 1, the com- 
ponents of binary strings x £ will be denoted by x^-'^ I < j < k: for 
any x £ B*^, we have x = {x''^\ . . . ,a;'^'^^). We will use juxtaposition to denote 
concatenation of strings: if y e B*^ and z € B', then yz £ B'''+' is the string 
X = {y^^\ . . . , 2/'-'°'', z*-^', . . . , z*-'-*). For any < fc < d, we will define the prefix 
mapping tt^ : B'' ^ B*^ and the suffix mapping crfc : B'' ^ B''^*'' by 

^fe(x)^(a;«,...,xW), afc(:E)^(x('=+i),...,xW), (1) 

so that X = Trk{x)(Tk{x) for any x G B*^ (note that both ttq and return an 
empty string). Whenever we deal with vectors whose components are indexed 
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by the elements of B'^ for some k, we will always assume that the components 
are arranged according to the lexicographic ordering of the binary strings in B*^. 
Given two real numbers a, 6, we let a Ab denote min{a, b}. Also, throughout the 
paper, C is used to denote a generic constant whose value may change from line 
to line; specific absolute constants will be denoted by Ci, C2, etc. 
Throughout the paper, we let M = 2'^. 



2.2. The Walsh system 

For any integer fc > 1, denote by /i^ the counting measure on B*^ and endow 
the space of functions / : B'^' — > M with the structure of the real Hilbert space 
L^(/ifc) via the standard inner product 

(/,.9)^ J2f{x)g{x). 

The Walsh system (see references in the Introduction) in L^(iJ,j,) is an orthonor- 
mal system — {fs '■ s E B*^}, defined by 

<^s(x) ^ 2^(-ir^ VxeB'= (2) 

where s ■ x ^ '^j^is'^^''x^^\ Hence, any / G L^{iik) has the Fourier-Walsh 
expansion 

where Og = {f,ips), s G B'^. To keep the notation simple, we will not explicitly 
mark the underlying dimension when working with the Walsh functions. When 
k = d, we will write $ instead of 

For any fc, the Walsh system $fc is a tensor product basis induced by $1 = 
{ipo, ipi}, where 

(po{x) = and ipi{x) ^ -^(-'^f 

for any a; G B. That is, for any fc > 1, any ips G has the form 

which means that 

fc 

i=l 

This generalizes to the following useful factorization property of the Walsh func- 
tions: for any fc > 1 and any ^ < fc, we have 

^s^ip.,,,is)®^a,Ms)^ VsGB'^ (3) 

where tt; ^ and ai^k denote the prefix and the suffix mappings defined on B*^ 
analogously to (1). This means that, for products of functions on disjoint subsets 
of the d variables, the Fourier- Walsh coefficients also have the product form. 
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2.3. Sparsity and weak-iP balls 

Our interest lies with functions whose Fourier- Walsh representations satisfy a 
certain sparsity constraint. Given a function / on B"^, let 6{f) denote the vector 
of its Fourier-Walsh coefficients. We will assume that the components of d{f) 
decay according to a power law. Formally, let . . . ,0(m)i where M = 2'', be 
the components of 0{f) arranged in decreasing order of magnitude: 

|^(1)|>|0(2)|>...>|^(M)|. 

Given some < p < oo, we say that 9{f) belongs to the Marcinkiewicz, or 
weaMP, ball of radius R [3, 21], and write 0{f) e if 

\Oini) \<R- m-^/P, 1 < m < M. (4) 

It is not hard to show that the Fourier-Walsh coefhcients of any probability 
density on B'* are bounded by I/a/M. With this in mind, let us define the 
function class 

Mp)^[f:B''^R:e{f)ewiPil/VM)} (5) 

We are particularly interested in the case < p < 1. 

We will need approximation properties of weak-£P balls as listed, e.g., in [5]. 
The basic fact is that the power-law condition (4) particularized to the elements 
of J-d(p) is equivalent to the concentration estimate 




valid for all A > 0. Additionally, for any 1 < A: < M, let Ok{f) denote the vector 
9{f) with ^(fe+i), . . . , 0(M) set to zero. Then it follows from (4) that 

mf)-9,if)\\,.^<CM-'^'k-^ (7) 

where r = 1/p — 1/2, and C is some constant that depends only on p. Hence, 
given any / e -Fdip) and denoting by fk the function obtained from it by trun- 
cating all but the k largest Fourier- Walsh coefficients, we get from Parseval's 
identity that 

II/-MIl^(^,) <CA./-i/2fc-^ (8) 

Thus, the assumption that / belongs to the sparsity class Td{p) for some p 
can be interpreted qualitatively as saying that the behavior of / is strongly 
influenced by a small number of subsets of the d covariates. The number of 
these influential subsets decreases as p 0. 
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3. Density estimation via recursive Walsh thresholding 

Let Xi , . . . , Xn be independent random variables in B'' with common unknown 
density /. We wisii to estimate / on the basis of this sample. For densities defined 
on the Euclidean space, nonparametric estimators based on hard or soft thresh- 
olding of empirically estimated coefficients of the target density in a suitably 
chosen basis (e.g., a wavelet basis) attain near-minimax rates of convergence of 
the squared-error risk over rich classes of densities [10, 19]. Thresholding is a 
means of controlling the bias- variance trade-off. 

Several authors have investigated the use of term-by-term thresholding rules 
for density estimation on the binary hypercube. There, one begins by computing 
the empirical estimates 

1 " 

0, = -y^,(X,) (9) 
1=1 

of the Fourier- Walsh coefficients of /, and then forming the thresholded esti- 
mator 

^{T(e,)>A„}^'''/'«' (10) 

where T(-) is some real-valued statistic and /{.} is the indicator function. Based 
on the observation that 

Var^.^i(A_,^.), (U) 

while the squared bias incurred by omitting the term dgipg from the estimator 
(10) is 6*3, Ott and Kronmal [28] considered the ideal thresholded estimator 

/* = X! -^{0J>l/Af(n+l)}^s<Ps- (12) 

Clearly, /* is impractical because the thresholding criterion depends on the 
unknown coefficients 0^. Instead, Ott and Kronmal [28] suggested that one could 
mimic the ideal estimator (12) by replacing &]. in the thresholding criterion by 
the unbiased estimator (nQ'\ — l/M)/{n — 1), leading to the practical estimator 

/WT = y -^{e2>2/M(«+l)}^''V's' (13) 

where WT stands for "Walsh thresholding." This estimator was further im- 
proved by Liang and Krishnaiah [24] and Chen, Krishnaiah and Liang [7], who 
replaced the hard thresholding rule in (13) with shrinkage rules. 

The main disadvantage of such termwise thresholding is the need to compute 
empirical estimates of all M = 2^^ Fourier-Walsh coefficients. While this is not 
an issue when d is comparable to log n, it is clearly impractical when d ^ log n. 
In order to alleviate this difficulty, we will consider a recursive thresholding 
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approach, which will allow us to reject whole groups of empirical coefficients 
based on efficiently implementable thresholding rules. The main idea behind 
this approach is motivated by the following argument. 

Given some 1 < fc < d, we can represent any function / e L^(^d) with the 
Fourier-Walsh coefficients {9s : s E B"*} as 

/ = ^ ^ GuvVuv 

where, for each u e B''', /„ = St,eB''-'= ^uvfv is a function in £'^{^1^^^)- The 
Fourier-Walsh coefficients of /„ are precisely those coefficients of / that are 
indexed by s € B'' with TTk{s) = u. By Parseval's identity, we have 

t)GB<*-'= 

This leads to the following observation: for any A > 0, 

Wu < X for some u e B''' ^ 6l.^ < A for every v £ B'^"''. 

The usefulness of this observation for our purposes comes from the fact that we 
can represent the strings s G B'^, and hence the elements of the Walsh system in 
L^(/Zd), by the leaves of a complete binary tree of depth d. Suppose we wanted 
to pick out only those coefficients of / whose squared magnitude exceeds some 
threshold A. If we knew that Wu < A for some w G B*^, then this would tell us 
that the square of every coefficient corresponding to a leaf descending from u 
does not exceed A. Hence, we could start at the root of the tree and at each 
internal node u that has not yet been visited check whether Wu > A; if not, 
then we would delete u and all of its descendants from the tree without having 
to compute explicitly the corresponding coefficients. At the end of the process 
(i.e., when we get to the leaves), we will be left only with those s e B'^ for which 
0j > A. If / e J^dip) for some p, then the resulting squared error will be 

sGB<* 

where r = 1/p — f/2, as before. 

We will follow this reasoning in constructing our density estimator. We begin 
by developing a suitable estimator for Wu- To do that, we shall rely on the 
following lemma (see Appendix A.l for the proof): 

Lemma 1. For any density f onM"^, any 1 < k < d, and any u £ M'^ , we have 

fu{y) = {^u{7Tk{X))l{„,^x)=y}} ,yy e B^-' 
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and 

Wu = {^„(7rfc(X))/„K(X))} . 

This lemma suggests that, for each 1 < fc < d and each u E M^, an empirical 
estimate of Wu can be obtained by 



— 1 " 

Wu = - V(^„(7rfc(X,)) 
11 ^ — ^ 



n . 



1 " 



^ n 71 

;i XI XI ('^'^ ) ) (^j))h'^k{xo=<yUx,)}- ( 14) 



n 

i=i i=i 



Although this is a biased estimator, it has the following useful property (see 
Appendix A. 2 for the proof): 

Lemma 2. For any 1 < k < d and any u € B*^, 

Wu^ (15) 



»d-fc 



where each 6uv is an empirical estimate of 9uv computed according to (9). 

Another advantage of computing Wu indirectly via (14), rather than (15), is 
that, while the latter requires 0{2'^^^n) operations, the former requires only 
0{n^d) operations. This can amount to significant computational savings when 
k < d — \og{nd). When k > d — \og{nd), it becomes more efficient to use the 
direct estimator (15). 

Now that we have a way of estimating Wu, we can define our density esti- 
mation procedure. Provided the threshold scales appropriately with the sample 
size, we will be able to achieve a good balance between the estimation error 
(variance) and the approximation error (squared bias) and attain near-minimax 
rates of convergence. In our analysis, we shall actually consider a more flexible 
strategy: for every 1 < k < d, we shall compare the estimate Wu of Wu to 
a threshold that depends not only on the sample size n, but also on k. More 
specifically, we will let 

Afc,„ = — , l<fc<d (16) 
n 

where the sequence {ak}f^i satisfies ai > afe > . . . > > 0. In particular, 
this set-up covers the following two extreme cases: 

1. afc = const for all k - this covers the standard case of always comparing 
against the same threshold (that depends on n) 

2. ak — const • 2'^~'' - this corresponds to thresholding not the sum of (a 
particular subset of) the coefficients, but their average. 

As we shall see, this fc-dependent scaling will allow us to flexibly trade off the 
expected error and the computational complexity of the resulting estimator. 
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Now we describe our density estimator. Given the sequence A — {Xk,n}t^i, 
define the set 

A{\) ^ {s e : > Afe,„,Vl <k<d} (17) 

and consider the density estimate 

/rwt — ^ I{seA(x)}(^s'Ps, (18) 

where RWT stands for "recursive Walsh thresholding." To implement this es- 
timator on a computer, we call the routine RecursiveWalsh, shown as Algo- 
rithm 1, with u = (the empty string, corresponding to the root of the tree) 
and with the desired threshold sequence A. The factors of 1/2 in the updates 
for Wug and arise because of the factorization property (3) of the Walsh 
basis functions: for any fc > and any s, x, x' E 8*''+^ we have 



(Pu{x)(pu{x') = (p,,i^(^s)(T^k{x))(p^^(s){Trk{x')) 



'V2 



Algorithm 1 RecursiveWalsh(w, A) 

k <r- length(?i) 
if fc = d then 

n 
1 = 1 

if ^ > ^d,n then 
output u, 6u 

end if 

return 
end if 
uo <— Ou 
ui <— lu 

n rt 

i=l j = l 

if WuQ < >^k+i,n then 

return 
else 

Recursive WALSH(tio, A) 
end if 

■i = l j = l 

if Wui < Afc_|_i „ then 

return 
else 

Recursive Walsh(ui , A) 
end if 
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3. 1 . Analysis of performance 

Let us denote by T^'^{p) the set of all probability densities in J-d{p)- 



Our first main result is that, with appropriately tuned thresholds, the estimator 
(18) adapts to unknown sparsity of the Fourier-Walsh representation of /: 

Theorem 1. There exist absolute constants Ci, C2 > 0, such that the following 
holds. Suppose the threshold sequence A = {Xk}'l^i is chosen in such a way that 
oik > Cid/M for all k, where M = 2'^. Then for all n < 2rM'^''+^ the estimator 
(18) satisfies 

sup E,||/-/RWT||i.(,,)<^M— (20) 

for alio < p < 1, where, as before, r = l/p—1/2. Moreover, ifak > Cidlogn/M 
for all k, then the risk of (18) is bounded as 

Fiif ? ii2 ^ /iogAnogny/(2'-+i) 

for all n. 

Remark 1. Positivity and normalization issues. As is the case with orthogonal 
series estimators, /rwt may not necessarily be a bona fide density. In particular, 
there may be some x £ M'^ such that /rwt (2;) < 0, and it may happen that 
/ /AWT^Md 7^ 1- In principle, this can be handled by clipping the negative values 
at zero and renormalizing; this procedure can only improve the expected 
error. In practice this may be computationally expensive when d is very large. If 
the estimate is suitably sparse, however, the renormalization can be carried out 
approximately using Monte-Carlo estimates of the appropriate sums. Moreover, 
in many applications the scaling of the density is not important. □ 

Remark 2. Logarithmic factors in the risk bound. For each < p < 1, the 
bound (20) wiU hold for all n < 2r]VP''+^. Since 

2^7\i-2r+l J^,j2/p > j^2/p^ < p < 1 

it follows from Theorem 2 below that /kwt with thresholds Ai = . . . = Ad ~ 
nd/M is minimax-optimal for each < p < 1, assuming that the sample sample 
size n satisfies d = logM ^ n ^ M^/p. For very small sample sizes n ^ log M 
and for very large sample sizes n > M^/p, /rwt will be suboptimal. With a 
more conservative choice of thresholds, Ai = . . . = A^ ^ ndlogn/M, the bound 
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(21) will hold for all values of n. In particular, in this case the minimax rate will 
be attained, up to logarithmic factors, for all values of p e (0, 1) simultaneously 
in the moderate sample regime logM^n^M^/P. □ 

Our second main result is a lower bound on the minimax risk attainable 
by any estimator over It shows, in particular, that our recursive esti- 

mator /rwt is minimax for "moderate" sample sizes logM ^ n ^ M^^p. For 
large sample sizes, n >: M^^p, /rwt is no longer optimal — in particular, it is 
outperformed by both the simple histogram estimator 

1 " 

and by the unthresholded orthogonal series estimator 

both of which attain the optimal OiXjn) risk. The precise statement is as follows: 

Theorem 2. Consider the problem of estimating an unknown f G J'd '^ip) from 
n i.i.d. samples Xi, . . . ,X„. Then the following statements hold: 

1. Suppose that logM <n< M'^^^^'^^/p for some e G (0, 1). Then there exists 
a positive constant C = C{p,e), such that 

/" /e^^''(p) ^ ^ ^ 

where, as before, M = 2'^ . 

2. Suppose that n > M'^/p and M > 4. Then there exists an absolute constant 
C > 0, such that 

inf sup ^fWfn- ffm^,)>-. (23) 

Our third, and final, main result bounds the running time of the algorithm used 
for computing /rwt: 

Theorem 3. Fix any f G J-d{p). Given any 5 G (0,1), provided each Uk is 
chosen so that 

nCi(2X„A2'=/V„)>^°^J^, (24) 

log e 

where 

and Ci,C2 > are certain absolute constants, then Algorithm 1 runs in 

o{n'd['^Y" K{c,p)^ (25) 
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time with probability at least 1 — 6, where K{a,p) = Ylt^i '^k^^^ '^^ before, 
M = 2'^. 

Remark 3. Trade-off between time complexity and MSB. By controlling the rate 
at which the sequence ak decays with k, we can trade off MSE against complex- 
ity. Consider the following two extreme cases: (1) ai = . . . = ^ 1/M and (2) 
ak ~ jM. The first case, which reduces to the term- by-term thresholding, 
achieves the same bias- variance trade-off as the Ott-Kronmal estimator [28]. 
However, it has Kitx^p) — 0{MP/^d), resulting in 0(d2n2+P/2) complexity. The 
second case, which leads to a very severe estimator that will tend to reject a lot 
of coefficients, has MSE of 0(n-2'-/(2'-+i)M-i/(2'-+i)), but K{a,p) = 0{Mv/'^), 
leading to a considerably better 0{dn^^P/^) complexity. From the computa- 
tional viewpoint, it is preferable to use rapidly decaying thresholds. However, 
this reduction in complexity will be offset by a corresponding increase in MSE. 
In fact, using RWT with exponentially decaying a^-'s in practice is not advis- 
able as its low complexity is mainly due to the fact that it will tend to reject 
even the big coefficients very early on, especially when d is large. To achieve a 
good balance between complexity and MSE, a moderately decaying threshold 
sequence might be best, e.g., ak ^ {d — k + 1)™/Af for some m > 1. As p — > 0, 
the effect of A on complexity becomes negligible, and the complexity tends to 
0{n'^d). ' □ 

Remark 4. Incoherence. Note that for any of the above choices of ak, the 
proposed method requires polylog(M) operations. One intuitive explanation for 
why such fast computation is possible is that the Walsh basis is "incoherent" (to 
use term common in compressed sensing and group testing literature) with the 
canonical basis of L^{iJ.d)- Similar polylog computation results were achieved by 
Gilbert et al. in the context of fast sparse Fourier approximation [14, 15] and 
group testing [16]. Their strategies also had connections to the Goldreich-Levin 
algorithm [17], as well as to the work of Kushilevitz and Mansour on sparse 
Boolean function estimation [22, 25]. □ 

4. Proofs of the theorems 

In this section we prove our three main results. However, before proceeding 
to the proofs, we collect all the technical tools that we will be using: moment 
bounds, concentration inequalities, and an approximation-theoretic lemma per- 
taining to class T^'^ip). 

4-1. Preliminaries 

4.1.1. Moment bound 

We will need the following result of Rosenthal [30]. Let /7i, . . . , [/„ be i.i.d. ran- 
dom variables with EUi = and EUf < . Then for any m > 2 there exists 



imsart-ejs ver. 2011/11/15 file: binary_hypercube.tex date: December 3, 2012 



Raginsky et al. /Recursive density estimation on the binary hypercube 



15 



some Cm such that 



E 



1=1 



-,m— 1 



(26) 



4- 1.2. Concentration bounds 

We will need the well-known Hoeffding inequality: if Ui,...,Un are i.i.d. 
bounded random variables such that EUi — and \ Ui\ < 6 < oo for all 1 < i < n, 
then 

^ " ' " ' > i I < 2exp 



'2&2 ] 



(27) 



The following result is due to Talagrand [35]. Let Ui,...,Un be i.i.d. ran- 
dom variables, let ei,...,e„ be independent Rademacher random variables 
[i.e., F{ei — —1) = P{ei = 1) = 1/2] also independent of C/i,...,J7„, and 
let be a class of functions uniformly bounded by L > 0. Then if there exist 
some v,H > such that sup g^jr Vai g{U) < v and 



E<^ sup Ve,g(f/,) \ <nH 

Ue^^=l J 



(28) 



for all n, then there are universal constants Ci and C2 such that, for every 
T > 0, 



sup Vnig) > T + C2H ) < exp 



r2 T 



where 



<?([/,) -Eg([/), ygeT 



(29) 



(30) 



is the empirical process indexed by J-. 

Remark 5. Typically, some additional regularity conditions on T are needed 
to ensure measurability of the supremum sup^gjr lynig) of the empirical process 
(30). However, when U takes values in a finite set, as is the case in this paper, 
there is no need for such conditions because any uniformly bounded class of 
real-valued functions on a finite set is separable: it contains a countable subset 
To, such that for any g <E F there exists a sequence gi, 172, • • • G converging 
to g pointwise. Such a separability property ensures measurability of suprema 
over F [37, p. 110]. 



4-. 1.3. Large separated subsets of J-^'^{p) 

In the sequel, we will be interested in large subsets of the class of densities 
■^d '^ip) ■^dip)^ whose elements are separated from one another by a given 
fixed amount, as measured by the norm || • ||l2(^^). The following lemma, whose 
proof is given in Appendix A, will be useful: 
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Lemma 3. Let r = 1/p — 1/2. Let Si, . . . ,sm, M = 2'^, be the lexicographic 
ordering of the elements ofM'^. Given a positive real parameter a and an integer 
k G {1, . . . , M - 1}, let 6(M, k, a) C E*^"^ consist of {M - 1) -dimensional real 
vectors having exactly k nonzero components, each of which is equal to either a 
or —a. With this, define the set T{k,a) C L^{^d) by 

Hk,a) = |/ : 6,,{f) = {0s,{f))2<,<M e e(M,fc,a) 

Suppose that k and a are such that 

ka<^= and a < — L(fc + 1)-!/^. (31) 
2Vm VM 



Then the following statements hold: 

1. The set J-{k,a) is contained in T'^'^{p). 

2. For any two /, /' G J^(k, a) we have 

D{f\\f') < 2Mka\ (32) 

where D{-\\-) is the Kullback-Leibler divergence (relative entropy) [8]. 

3. If k — M — 1, then there exists a set T{M — 1, a) C T{M — 1, a) with the 
following properties: 

• \\f ~ f'\\h(^,)>{M -1)0^ for all f,f' e F{M -l,a) with f ^ f 

• log \P{M - 1, a)| > (M - l)/8 

^. If M — 1 > 4fc, then there exists a set J-{k, a) C J-{k, a) with the following 
properties: 

• 11/ - f'Whi^,) > /' /' e Hk, a) with f ^ f 

• log \f{k, a) I > 0.233 k (log ^ + l) 

4-2. Proof of Theorem 1 

Let us decompose the squared L^ error as 

11/ - /RWT||i2(^_j) = ^I{s<^A{\)}{Os - OsY + ^I{seA{X)<=}Ol = Ti+ T2. 
s s 

We start by observing that any s £ necessarily satisfies Wg = 6l > Xd.m 

while for any s € A{XY there exists some 1 < k < d such that Wt^^(^s) < Afe.„, 
which implies, in turn, that ^^.(^jj < Xk,n for all t e B''"'^ and, in particular, 
that 9l < Xk^n < -^i.n- Therefore, defining the sets 

= {s e B'^ : ^ > Xd,n} and A2 = {s € M'^ : < Ai,„}, 
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we can bound Ti and T2 as 

Ti<Y,hseA,}{Os-0sf 

s 

and 

s 

Further, defining 

B = |s e B'* : 6*2 < A<j.„/2| and S* = |s G B'' : 6*2 > 3Ai,„/2|, 
we can write 

Tl < ^ I{s£AinB}i(^s — Osf' + ^ I{seAinB'=}{&s — = ^11 + ^12i 

s s 

and 

s s 

Applying (6) and (11), we get 

ET12 < ^^^^ ~ ^«)' 



sGB" 



< 



\{s : 9^^ > Xd,n/2}\ 



< 



1 / 2 ^P^' 



Mn \M\d,n 
1 ( In 



Mn V Ma 



1 



M \2'^ad 



\ p/2 



<i 

< J_„-2r-/(2r+l) 

To bound T22 we apply (8): 



£122 < ^ -^{02<3qj/2„}6's < 



A/ V 
C AogA/\ 



2r/(2r+l) 



M V n 
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In order to deal with the large-deviation terms Tn and T21, we will need some 
moment and concentration bounds which are listed in Section 4.1. First, using 
Cauchy-Schwarz, we get 



ETn < J2 y^i^s - ■ P(^i n B) (35) 

s 

To estimate the fourth moment in (35), we apply the bound (26) to Ui = 
ips{Xi) — 9s, 1 < i < n, and to = 4. Then 

EU^ ^ E(^,(X,) - 0sf - ^ - < ^ 

and 



so that 



To handle the probability of Ai n B, we first estimate 

1^, -0s\^ = {ds - Osf = el - 20 A + el>ol- 2|0s^s| + el = {\e,\ - \es\f. 

From this we conclude that 9l > Xd.n and 9^ < Xd,n/2 together imply 

\9s ~ 9s\ > -\/Xd,n = 

O o \ Ti 

(the factor of 1/5 is simply a lower bound on 1 — 1/^/2). Therefore, 

ViA,nB)<v(^\l-9,\ > • 

Applying Hoeffding's inequality (27) to Ui = Lps{Xi) ~ 9s, I < i < n, with 
b = 2/\/M and using the fact that ad > Cid/M, we get 

for some absolute constant C > 0. If Ci is chosen so that C > 2, then we will 
have 

ET„ < (36) 

Finally, 

ET2i<Y,nA2nS)9l 

s 

Using the same argument as above, we can write 

nA. n 5) < ^ 
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(with the same choice of the constant Ci as before). Then 

2 

M2 



2 1 
- M2 M ^ 

m—l 

2 M-(2'^+i) 

< 



M2 2r 

where the first inequahty uses the fact that / £ Td{p), while tlie second inequal- 
ity follows from 



M „M 



M-{2/p-l) ]^,i~2t 



2/p-l 2r 



m—l 

Since n < 2rM'^^'^^, M~^'^^'^^'^ /2r < n~^. Consequently, we will have 

E^-<^. (37) 

Putting together Eqs. (33), (34), (36), and (37), we get (20). The second bound 
(21) is proved along the same lines, except that the extra logn factor in the 
thresholds will give ETn < C/Mn and ET21 < C/Mn. 

4.3. Proof of Theorem 2 

The proof of the first part uses a popular information-theoretic technique due 
to Yang and Barron [39, 40]; we only outline the main steps. The first step is to 
lower-bound the minimax risk by the minimum probability of error in a multiple 
hypothesis test. Let Fa be an arbitrary subset of F^'^{p). Then 

inf sup E/ll/ - > inf sup E;||/ - 

In particular, suppose that the set Fq is finite, Fq — {Z'^-*, . . . , /'■^■'}, and S- 
separated in ^^(/i^), i.e., 

\\f^'^-f'^''^\\LH^,)>S, Vi,j-e{l,...,7V};z^j (38) 

Then a standard argument [40] gives 

(52 

%ll/-/llL^fu.) > 



inf sup EfWf- > ^ minP (/ ^ /(^)) , (39) 



where the random variable Z is uniformly distributed over the set {!,..., N}^ 
and the minimum is over all estimators / based on X" that take values in the 
packing set {/'■^•', . . . , Z^^-*}. Applying Fano's inequality [8], we can write 

.■■i..p(/>/-)^l-^^4™. (40) 
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where is the Shannon mutual information [8] between the random in- 

dex Z e {!,..., N} and the observations Xi, . . . , X„ ''^^ J^^-*. In this particular 
case, we have 

N 

= ^E^(/'''ii/)' (41) 

fe=l 

where / denotes the mixture density N^^ X^fci Z*'^''- The next step consists in 
upper-bounding this mutual information. To that end, suppose that there exists 
some A > 0, such that 

D{f\\f')<A, yfJ'eTo. (42) 

Using convexity of the relative entropy and (42), for every k E {!,..., N} we 
have 

1 ^ 

e=i 

Substituting this into (41), we see that I{Z\X'^) < nA. Combining this bound 
with (39), we get 



inf sup E;||/-/||i.(^^) > 



^ _ nA + log2 
log TV 



(43) 



In particular, let fc G {1, . . . , M — 1} and a > satisfy the conditions (31) of 
Lemma 3, as well as 

a' <^log\fik,a)\ (44) 

for a suitable constant C > 0, where T{k, a) are the subsets of J-^'^{p) described 
in Lemma 3. If we let To — T{k,a), then, by Lemma 3, (42) holds with A = 
_L i^IM^I ^ This, in conjunction with (43), gives 

inf sup EjWf- > CS'{k,a), (45) 

/ feJ^a 

where 

Sik, a) ^ min {ll/ - fh^i^,) : fj' € f{k, a); / ^ /'} 

is the minimal L^(/irf)-separation between any two distinct elements of J-{k,a). 
We can now consider the following cases: 



imsart-ejs ver. 2011/11/15 file: binary_hypercube.tex date: December 3, 2012 



Raginsky et al. /Recursive density estimation on the binary hypercube 21 

1. Suppose that M'^^p < n. Then we take k = M — 1 and — Because 
in this case 

kW(k + l^/P = M^/P <Vn< — 



and log |J"(M - l,a)| > {M - l)/8, the conditions (31) and (44) will be 
satisfied for a suitable choice of C. Moreover, by Lemma 3, we have 

d^(k,a) = 5^(M-l,a) > (M^l)a^ > -. 

n 

Substituting this into (43), we obtain (23). 
2. Suppose that 7i < M^^^-'^^/p for some e G (0,1). Let A: = C I " 



log M 



anda^a Then 



/cV(fc + l)i/P < {2k)^/P 

1/2 

(2C)i/P 



logM 
_ (2C)i/PC" 

If we choose C and C" in such a way that (2C)1/pC" < 1/2, then (31) will 
be satisfied. Moreover, we must have 

n ^ ' 



M - 1 > 4fc = 4C . (46) 

With our assumptions on n and Af, this will hold for all sufficiently 
large M. Next, we check that (44) is satisfied. Assuming that (46) holds. 
Lemma 3 implies that 

1 , 1^., M C , M-1 
^log|m«)l>^log — 

C , (M- l)2/PlogAf 

> log ^ — . 

- Mn ^ n 

Again, using our assumption on n and Af , as well as the fact that a? = 
'^Mn^ , we can guarantee that (44) holds, with an appropriate choice of 



C = C{p,e). By Lemma 3, we will have 
S'^ik,a) > ka^ > 



C /logA/\^ ^ C /logA/N^'+i 



M \ n J M 

and we obtain (22). 



imsart-ejs ver. 2011/11/15 file: binary_hypercube.tex date: December 3, 2012 



Raginsky et al. /Recursive density estimation on the binary hypercube 



22 



4.4. Proof of Theorem 3 

The time complexity of the algorithm is determined by the number of recursive 
calls made to RecursiveWalsh. Recall that, for each 1 < A; < d, a recursive 
call to RecursiveWalsh is made for every u eM'^ for which W„ > Xk,n- Let us 
say that a recursive call to Recursive Walsh(ii, A) is correct if W„ > Afc^„/2. 
We will show that, with high probability, only the correct recursive calls are 
made at every 1 < k < d. The probability of making at least one incorrect 
recursive call is given by 

d 

U U {Wu>\k^n,Wu<Xk,n/2} 

yk=l ueB'' 

d 

fc=l«GB'= 

For a given u G B'^, let 



Then W„ > Xk,n and Wu < \k,nl'2 together imply that 



> Wu-2^WuWu + Wu 
2 



1 



> Afe,„/25. 

Now, as shown in Appendix B, for each u e B*^, the norm — fu\\L'^(p.^^^) can 
be expressed as a supremum of an empirical process over a suitable function 
class, to which we can then apply Talagrand's bound (29) with L = 1/v^, 
V = 1/2^, and H = l/VWn. Hence, 

> Afe,„, W„ < Afe,„/2) < P(||/„ - 7n||i2(^^_^) > 

< exp {-nCi(2'=a2 „ ^ 2^/2afc,„)} , 
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where for each I < k < d, 




Here, Ci and C2 are the absolute constants in Talagrand's bound (29). Given 
J > 0, if we choose at according to (24), then 

nWu>Xk,n,Wu<Xk,n/2)<^, V?/eB^ 

Summing over I < k < d and u e B*^, we see that, with probabiUty at least 
1 — S, only the correct recursive calls will be made. 

Next, we give an upper bound on the number of the correct recursive calls. 
For each I < k < d, > Afc^„/2 implies that there exists at least one v £ B'^^'^ 
such that 9'^^ > Xk,n/'^- Since for every 1 < k < d each dg contributes to exactly 
one Wu, we have by the pigeonhole principle that 

|{?/eB'=:W:„>Afc,„/2}| < 

< 

where in the second line we used (6). Hence, the number of the correct recursive 
calls in Algorithm 1 is bounded by 

At each recursive call, we compute an estimate of the corresponding Wuo and 
Wui, which requires 0{n^d) operations. Therefore, with probability at least 
1 — (5, the time complexity of the algorithm is given by (25). 

5. Simulations 

Although an extensive empirical evaluation is outside the scope of this paper, 
we have implemented the proposed estimator, and now present some simulation 
results to demonstrate its small-sample performance on synthetic data in low- 
and high-dimensional regimes. In the low-dimensional regime, it is feasible to 
obtain the "ground truth" by exhaustively computing all the 2"^ Walsh coeffi- 
cients and to compare it with our estimate. In the high-dimensional regime, our 
comparison is based on the density values at randomly generated samples. Addi- 
tionally, we present a number of computational strategies that greatly enhance 
computational efficiency in the high-dimensional regime. 



{seB'*:02> Afc^„/2} 
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5.1. Low- dimensional simulations 

We generated synthetic observations from a mixture density / on a 15- 
dimensional binary hypercube. The mixture has 10 components, where each 
component is a product density with 12 randomly chosen covariates having 
Bernoulh(l/2) distributions, and the other three having Bernoulli(0.9) distribu- 
tions. For d 15, it is still feasible to quickly compute the ground truth, con- 
sisting of 32768 values of / and its Walsh coefficients. These values are shown 
in Fig. 1 (left). As can be seen from the coefficient profile in the bottom of the 
figure, this density is clearly sparse. Fig. 1 also shows the estimated probabilities 
and the Walsh coefficients for sample sizes n = 5000 (middle) and n = 10000 
(right). 

To study the trade-off between MSE and complexity, we implemented three 
different thresholding schemes: (1) constant, Afc „ = 2/(2'^n), (2) logarithmic, 
Xk,n = 21og(d - A: + 2)/(2''n), and (3) linear, Afc,„ = 2{d - k + l)/(2'^n). The 
thresholds at fc = d arc set to twice the variance of the empirical estimate of 
any coefficient whose value is zero; this forces the estimator to reject empirical 
coefficients whose values cannot be reliably distinguished from zero. Occasion- 
ally, spurious coefficients get retained, as can be seen in Fig. 1 (middle) for the 
estimate for n = 5000. Fig 2 shows the performance of Jkwt- Fig. 2(a) is a 
plot of MSE vs. sample size. In agreement with the theory, MSE is the small- 
est for the constant thresholding scheme [which is simply an efficient recursive 
implementation of a term- by-term thresholding estimator with A„ ^ 1/Mn], 
and then it increases for the logarithmic and for the linear schemes. Fig. 2(b,c) 
shows the running time (in seconds) and the number of recursive calls made to 
RecursiveWalsh vs. sample size. The number of recursive calls is a platform- 
independent way of gauging the computational complexity of the algorithm, 
although it should be kept in mind that each recursive call has O(n^d) over- 
head. Also, the number of recursive calls depends on whether a binary or N-avy 
tree is utilized. The iV-ary tree scheme is explained in detail below, in Section 

5.2. We have used N — 256 in the simulations, as this setting leads to much 
reduced computations times vs. a binary tree. 

The running time increases polynomially with n, and is the largest for the 
constant scheme, followed by the logarithmic and the linear schemes. We see 
that, while the MSE of the logarithmic scheme is fairly close to that of the con- 
stant scheme, its complexity is considerably lower, in terms of both the number 
of recursive calls and the running time. In all three cases, the number of re- 
cursive calls decreases with n due to the fact that weight estimates become 
increasingly accurate with n, which causes the expected number of false discov- 
eries (i.e., making a recursive call at an internal node of the tree only to reject 
its descendants later) to decrease. Finally, Fig. 2(d) shows the number of coef- 
ficients retained in the estimate. This number grows with n as a consequence 
of the fact that the threshold decreases with n, while the number of accurately 
estimated coefficients increases. 

Additionally, we have performed comparisons between our proposed method 
and two alternatives: the Ott and Kronmal thresholding estimator [28]; and 
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an exhaustive search over all possible thresholds for the best MSE. As seen in 
Fig. 3, our thresholding estimator provides close to the best possible MSE with 
far lower computational cost than the alternatives. 

5.2. High- dimensional simulations 

Although our algorithm has been implemented in MATLAB and therefore can be 
much further optimized for speed, we have devised several strategies for travers- 
ing the coefficient tree efficiently and circumventing computer-architecture re- 
lated challenges for high-dimensional problems. We describe those strategies and 
demonstrate them using the same experimental set-up as above, but with much 
larger dimensionality. 

• Direct computation of the coefficients near the leaves of the tree. 

As discussed in 2.3, the direct estimator (15) for Wu becomes computa- 
tionally more efficient than the indirect estimator (14) for k > d—\og{nd). 
Hence, near the bottom of the tree, instead of continuing to traverse the 
remaining levels based on the weights we simply compute all coef- 
ficients 6u at the leaves of the corresponding subtree. This is always the 
most sensible course of action, given the fact that (15) requires the 9u 
anyway. 

• N-ary tree. The number of levels to be traversed can be reduced by a 
factor of log N by considering an A^-ary, rather than a binary, tree, at the 
cost of an increased number (N) of branches per level. We have found 
this trade-off to be worthwhile in many cases due to the possibility of 
vectorizing the computation of Wu for all the branches in each level and 
taking advantage of optimized routines for matrix algebra. 

• Open-node queue. While our algorithm lends itself to recursive imple- 
mentation, many computer/operating system architectures impose a hard 
limit on the recursion level due to stack-size restrictions (recursive func- 
tion calls typically use stack memory). This becomes a problem when the 
dimension d is high and the tree is accordingly very deep. We have circum- 
vented the issue by implementing a queue system where the so-called open 
nodes (the tree branches awaiting processing) are sorted according to some 
criterion. This amounts to transferring the "stack" to user memory, where 
the only limit on the number of nodes is the free memory size. Possible 
criteria for sorting the nodes include depth-first, breadth-first and high- 
est weight Wu- We have used the latter criterion in our high-dimensional 
simulations. 

• Pruning high-frequency Walsh coefficients. For a given Walsh func- 
tion ifs, the Hamming weight (i.e., number of "on" bits) of s is a measure 
of frequency; higher- frequency coefficients have a higher proportion of ones 
in s. Because in many problems it is appropriate to assume that the signals 
of interest have low frequency, we have included the ability to impose a 
limit on the order of the Walsh coefficients by ignoring any branches with 
more than m "on" bits. The choice of m depends on the problem context 
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and on computational resources. 
• Weight-adaptive thresholding. For some datasets, significant gains can 
be achieved by varying the thresholds Xk,n in a data-driven manner at 
each level of the tree; as an alternative to the preset schedules ak in 
(16), it is possible to take the weights Wu for each branch at level k, 
and then expand only the top q branches with the highest Wu- This is 
equivalent to making ak not only level-dependent, but also dependent on 
the sequence of weights at level k. The value of q controls the trade-off 
between computation speed and accuracy. 

The first three strategies are important modifications to a naive implementation 
of our algorithm, but in no way impact the MSB. The latter two techniques, 
however, provide an approximation to the estimator proposed and analyzed in 
this paper; for appropriate values of m and q, they yield significant computa- 
tional savings for a modest increase in MSB. 

In Figure 4, we present plots of the MSB and computation time for simu- 
lated data with d — 50 and multiple sample sizes (n), using the aforementioned 
optimizations. As before, the data were generated from a Bernoulli mixture den- 
sity, similar to the one used for Figure 2 but using ten 50-dimensional mixture 
components, where each component has 47 covariates with a Bernoulli (1/2) dis- 
tribution and three covariates with a Bernoulli(0.9) distribution. The results 
are averaged over ten independent runs. We have limited the number of "on" 
bits in the Walsh binary strings to three and eight {i.e., m = 3 and to = 8 
respectively), expanded only the 16 subtrees with highest Wu at each level (i.e., 
q — 16) and used an A^-ary tree with N — 256. The subtrees in the open-node 
queue were sorted by decreasing Wu- Even in this high-dimensional regime, we 
achieve steadily decreasing MSB as a function of n, as well as approximately lin- 
ear scaling in computation time. It is also apparent that setting to = 3 achieves 
essentially the same MSB but with an order-of-magnitude reduction in the com- 
putational effort. 

6. Summary and conclusion 

We have presented a computationally efficient adaptive procedure for estimating 
a multivariate binary density in the "big d, small n" regime, which essentially 
forces a "nonparametric" approach. Many problems of current practical inter- 
est that involve multivariate binary data seem to pertain to populations with 
certain "constellation" effects among the d covariates. We have formalized this 
observation by focusing on a class of densities whose Walsh representations ex- 
hibit a certain power-law behavior. For moderate sample sizes, our estimator 
attains nearly minimax rates of MSB convergence over this class and runs in 
polynomial time with high probability. Moreover, the complexity improves for 
sparser densities. We have also reported the results of simulations, which show 
that our implemented estimator behaves in accordance with the theory even 
in the small-sample regime. In the future, we plan to test our method on real 
high-dimensional data sets. Another promising future direction is to investigate 
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the relationship between various smoothness classes of densities on the binary 
hypercube defined in terms of their Fourier-Walsh representations and proba- 
bility densities of binary Markov random fields [23]. Such a density will have 
the form 

i=l 

where for each i G {1, . . . ,d} we have a neighborhood Ni C {!,..., 
the corresponding "local energy" function hi{-) depends only on cc*^*^ and on 
x^' = (a;(^^ : j G Ni), and Z is the normalization constant known as the 
partition function. It is reasonable to assume that if most of the neighborhoods 
Ni are small, then most of the Fourier- Walsh coefficients of / will be small as 
well. Assuming specific bounds on the decay of the Fourier-Walsh coefficients 
of / amounts to assuming something about the decay of correlations in the 
Markov random field governed by /. It remains to be seen whether sparsity 
classes of the type investigated in this paper can serve as a good model of 
binary Markov random fields with polynomial decay of correlations, or whether 
one would need to introduce a binary analog of something like (weak) Besov 
bodies [3] in order to account for localization both in space (small number of 
i's with large neighborhoods) and in frequency (small number of large Fourier- 
Walsh coefficients). In either case, it would be worthwhile to investigate the 
use of recursive thresholding estimators of the type introduced in this paper to 
estimate Markov graphical models on the binary hypercube. 
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Appendix A: Auxiliary proofs 
A.l. Proof of Lemma 1 

Using the appropriate definitions, as well as the factorization property (3) of 
the Walsh functions, we have 

E/ {(/3n(7rfe(X))/{^^(X)=y}} = fi^)'P^i^k{x))I{a^(^)^y} 

= ^ ^ Ovw'fvwizy) \(pu{z) 

zgB'= \ (■u,t«)eB'=xB<'-'= / 

weB''-'' 

= fu{y)- 



Similarly, 



Wu= J2 fu(y) 

= J2 E{¥'«(T.(^))/{.,(x)=,}}/u(y) 

y^B'^-'' 

= Y f i^)'Pu{T^k{x))fu{y)I{a,(x)=y} 



= E/{(^„(^fc(X))/„K(X))}, 
and the lemma is proved. 

A . 2. Proof of Lemma 2 

We begin by showing that, for any 1 <k < d and any m G B'^, 

W^^W^^ + Wui. (A.l) 
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From the factorization properties of the Walsh functions, we have 

^ n n 

= ■Z^'^'^'f-^o{-^k+l{Xi))ipuo{TTk+l)Xj))I{„^^^(^X,)=a, + ,{X,)} 



i=i ]=i 



and 



^ n n 
i=l j = l 



^ n 71 
i=l j = l 



^ n n 



i=l 3 = 1 

X (-1) ' ^{^fc + i(X.)=a, + i(X,)}- 

Adding these two expressions and using the fact that 
we get 

^ n n 

WuO + Wul = -l^^^ipuiTTkiX^))tpuiTrkiXj)) 



^ n n 

-^'^^fu{TTk{Xij)ipu{Trk{Xj))I{, 



= R. 

This proves (A.l). By induction, we have 

Wu= J2 

where C{u) denotes the set of all leaves descending from u. Since C{u) — {uv € 
B'^ ; w e B'^-'^} and since 



^ n n 

the lemma is proved. 



i=i j=i 
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A. 3. Proof of Lemma 3 

We first consider Item 1. By construction, for any / G T{k^a) we have 



1 

M 



M 



J=2 



and 



M 



M 



i=2 



1 

M 



1 

M 



ka 



ka 



Thus, if k and a satisfy the first condition in (31), then all / G J^{k,a) will be 
bounded between 1/2M and 3/2M. To see that the second condition in (31) 
implies T{k, a) C J^d{p), observe that in that case the Fourier- Walsh coefficients 
of / ordered according to decreasing magnitude are 



1 

/M' 



m — 1 

l^(m)(/)l = <( a, m = 2,...,fc + l 
0, m = k + 2,...,M 

from which it follows that / £ ^d{p)- Finally, any / g F{k,a) is also a proba- 
bility density because it is nonnegative and because 



E f{x) = (/, 1) = yM(/, ^o) = VM0o(/) = 1 



This shows that T{k,a) C J'^'^{p), as claimed. 

We now move on to Item 2. To prove that the bound (32) holds for any two 
densities /, /' G F{k,a), we use the fact that the KuUback-Leibler divergence 
D{f\\f') is bounded above by the chi-square distance: 



\f{x)~r{x)\' 
fix) 



(A.2) 



For any a: e B'' and /, /' e ^(fc, a), we have 



\f{x)-nx)r 



M 



< 



2ka^ 



Using this together with the fact that /, /' > l/2Af in (A.2), we arrive at (32). 

For Item 3, we need the following well-known combinatorial result from cod- 
ing theory, the so-called Varshamov-Gilbert bound (see, e.g.. Lemma 4.7 in [26]): 
For any m G N, there exists a subset JCm of B™ with the following properties: 
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• for any u,v G ICm with u v, 

m 

^I{ui^)^vi^)} > to/4 

• log |/C„| > m/8 

There is a one-to-one correspondence between J-{M — 1, a) and the binary hy- 
percube B^^^^, under which each / e J^(M — l,a) is mapped to Uf £ B*^^^ 
with 

(i) 

u)' (/)=«}, j = l,...,Af-l. 

Given the Varshamov-Gilbert set JCm-i C B*^^^, let 

f{M-l,a) = {/ e J"(M- l,a) : € /Cm-i}- 
Then for any two /, /' e J^{AI — 1, a), we have 

M-l 

and log I J"(M - 1, a)| = log |/Cm-i| > (M - l)/8. This takes care of Item 3. 

Finally, we consider Item 4. To that end, we will need a refinement of the 
Varshamov-Gilbert bound due to Reynaud-Bouret [29] , which for our purposes 
can be stated as follows (see Lemma 4.10 in [26]): For any m, fc e N with 
m > A:, let BJI' denote the subset of B™ that consists of all u e B™ with 
J27LiI{u(i)=i} = k. E m > 4fc, then there exists a set ICm,k C B™ with the 
following properties: 

• for any u,v £ Km.k with u ^ v, 

m 
i=l 

• log\JC,n,k\ > 0.233A:log(m/fc) 

Now assume that A/ — 1 > 4fc and consider the corresponding set K.M~i,k- To 
each u G ICM-i,k we can associate 2'^ elements of J"(M — l,k), say {fu,v '■ 
V G B*^}, such that u determines the locations of the k nonzero Walsh-Fourier 
coefficients taking values in {—a, a}, while the choice of w S B*^ determines the 
signs of these coefficients. Thus, let 

J"(/c, a) = |/„,i, : w e K:,n,k,v G B'=|. 

Consider now any two f'^'^,fu',v' G P{k,a), such that {u,v) ^ {u\v'). For any 
j G 1, . . . , A/ - 1 such that u^-^') ^ u"^^\ either u^^'> = or = 0. Suppose the 
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latter. Then \Os^^.i{fu,v)\ = a, while lOs^+Afw .v')\ = 0. Consequently, 

M-l 

\\fu,v - /u',i;'|Il2(^^) ^ I{uU)^u'U)} > ko^ 



and 



log|J"(fc, a)| = /clog2 + log |/CM-i,fc| 

M - 1 

> A; log 2 + 0.233fc log — - — 



> 0.233fc ^log + 1 

This proves Item 4. 

Appendix B: Empirical process representation 

In this appendix, we show that for each u G B'"', the norm ||/u — fu\\L^{iici-k) '^^"^ 
be expressed as a supremum of an empirical process over a suitable function 
class; this was a key element of the proof of Theorem 3. 

First, we show that \\fu~fu\\L^(^i^_t,) can be expressed as an empirical process 
of the form (30) indexed by a suitable function class. To this end, define 



Then 



Wfu - fuWmp^^k) = sup Vr,{g). (B.l) 



Indeed, let Xi, . . . , X„ be n i.i.d. copies oi X ^ f . Then 

1 " 
n ^-^ 

i=l 
1 " 

= E i^uv - Ouv)Civ) 



where in the last line we used Cauchy-Schwarz. This proves (B.l). Next, we 
determine the constants L, v and H that are needed to apply Talagrand's bound 
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(29). For any ^ e 



{iJLd-k) with unit norm, we have 




Hence, any g £ !F is bounded by L = 2~'^/^. From this, we also get the bound 
Var(7 < V with v = L"^ = . Finally, to bound the Rademacher average (28), 
we note that F is the unit ball in the RKHS induced by the kernel 



Then standard arguments (see, e.g., Section 2.4.2 in [27]) lead to the bound 



which gives H = l/yOFn. 
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Fig 1. Ground truth (top) and estimated density for n = 5000 (middle) and n — 10000 
(bottom) with constant thresholding. Left column: true and estimated prohabtlittes 
(clipped at zero and renormalized) arranged m lexicographic order. Right column: ab- 
solute values of true and estimated Walsh coefficients arranged in lexicographic order. 
For the estimated densities, the coefficient plots also show the threshold level (dotted 
line) and absolute values of the rejected coefficients (lighter color). 
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Fig 2. Small-sample performance o/ /rwt in estimating f with three different thresh- 
olding schemes: (a) MSE; (b) running time (in seconds); (c) number of recursive calls; 
(d) number of coefficients retained by the algorithm. All results are averaged over five 
independent runs for each sample size (the error bars show the standard deviations). 
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Fig 3. Comparison o//rwt with the Ott and Kronmal (O&K) estimator [28] and with 
an exhaustive search for the best MSE. The plots show: (a) /rwt MSE; (h) /rwt 
times; (c) O&K MSE; (d) O&K times; (e) Exhaustive search MSE; (f) Exhaustive 
search times. All results are averaged over ten independent runs for each sample size 
(the error bars show the standard deviations). 
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Fig 4. Performance for a large-dimensional problem (d — 50): (a) Log~MSE; (b)- 

(c) running time (in seconds) for m = 3 and m — 8, respectively. All results are 

averaged over ten independent runs for each sample size (the error bars show the 

standard deviations). 
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